I have chosen to do an analysis about the history of baseball because I would want to have a broader understanding of the sport and to have more diverse topics. For this project, we are going to perform analysis in different divisions to determine if there are correlations with attendance, performance, salary, and size.
library(knitr)
library(MASS)
library(tibble)
library(markdown)
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(dplyr)
library(lubridate)
library(broom)
library(dotwhisker)
library(cem)
library(car)
library(plyr)
library(tidyr)
library(scales)
library(plotly)
library(DT)
attendance <- read.csv("./teams.csv",na.strings=c("","NA"))
pitching <- read.csv("./pitching.csv")
player_info <- read.csv("./player.csv")
player_salary <- read.csv("./salaries.csv")
batters <- read.csv("./batting.csv", stringsAsFactors = FALSE)
fielding <- read_csv("./fielding.csv",col_types = cols())
head(attendance, 5)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 <NA> CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 <NA> CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 <NA> FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 <NA> NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 NA NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 NA NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 NA NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 NA NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 NA NA NA 313 121 3.72 32 1 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 225 NA 0.83 Boston Red Stockings
## 2 753 308 6 28 22 218 NA 0.82 Chicago White Stockings
## 3 762 346 13 53 34 223 NA 0.81 Cleveland Forest Citys
## 4 507 261 5 21 17 163 NA 0.80 Fort Wayne Kekiongas
## 5 879 373 7 42 22 227 NA 0.83 New York Mutuals
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
head(pitching, 5)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER
## 1 bechtge01 1871 1 PH1 <NA> 1 2 3 3 2 0 0 78 43 23
## 2 brainas01 1871 1 WS3 <NA> 12 15 30 30 30 0 0 792 361 132
## 3 fergubo01 1871 1 NY2 <NA> 0 0 1 0 0 0 0 3 8 3
## 4 fishech01 1871 1 RC1 <NA> 4 16 24 24 22 1 0 639 295 103
## 5 fleetfr01 1871 1 NY2 <NA> 0 1 1 1 1 0 0 27 20 10
## HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 1 0 11 1 NA 7.96 NA NA NA 0 NA NA 42 NA NA NA
## 2 4 37 13 NA 4.50 NA NA NA 0 NA NA 292 NA NA NA
## 3 0 0 0 NA 27.00 NA NA NA 0 NA NA 9 NA NA NA
## 4 3 31 15 NA 4.35 NA NA NA 0 NA NA 257 NA NA NA
## 5 0 3 0 NA 10.00 NA NA NA 0 NA NA 21 NA NA NA
head(player_info, 5)
## playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01 1981 12 27 USA CO
## 2 aaronha01 1934 2 5 USA AL
## 3 aaronto01 1939 8 5 USA AL
## 4 aasedo01 1954 9 8 USA CA
## 5 abadan01 1972 8 25 USA FL
## birthCity deathYear deathMonth deathDay deathCountry deathState
## 1 Denver NA NA NA
## 2 Mobile NA NA NA
## 3 Mobile 1984 8 16 USA GA
## 4 Orange NA NA NA
## 5 Palm Beach NA NA NA
## deathCity nameFirst nameLast nameGiven weight height bats throws
## 1 David Aardsma David Allan 220 75 R R
## 2 Hank Aaron Henry Louis 180 72 R R
## 3 Atlanta Tommie Aaron Tommie Lee 190 75 R R
## 4 Don Aase Donald William 190 75 R R
## 5 Andy Abad Fausto Andres 184 73 L L
## debut finalGame retroID bbrefID
## 1 2004-04-06 2015-08-23 aardd001 aardsda01
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01
## 3 1962-04-10 1971-09-26 aarot101 aaronto01
## 4 1977-07-26 1990-10-03 aased001 aasedo01
## 5 2001-09-10 2006-04-13 abada001 abadan01
head(player_salary, 5)
## yearID teamID lgID playerID salary
## 1 1985 ATL NL barkele01 870000
## 2 1985 ATL NL bedrost01 550000
## 3 1985 ATL NL benedbr01 545000
## 4 1985 ATL NL campri01 633333
## 5 1985 ATL NL ceronri01 625000
head(batters, 5)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO <NA> 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 <NA> 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 <NA> 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 <NA> 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 <NA> 25 120 29 39 11 3 0 16 6 2 2
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
head(fielding, 5)
## # A tibble: 5 x 18
## playerID yearID stint teamID lgID POS G GS InnOuts PO
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <int>
## 1 abercda01 1871 1 TRO <NA> SS 1 <NA> <NA> 1
## 2 addybo01 1871 1 RC1 <NA> 2B 22 <NA> <NA> 67
## 3 addybo01 1871 1 RC1 <NA> SS 3 <NA> <NA> 8
## 4 allisar01 1871 1 CL1 <NA> 2B 2 <NA> <NA> 1
## 5 allisar01 1871 1 CL1 <NA> OF 29 <NA> <NA> 51
## # ... with 8 more variables: A <int>, E <int>, DP <int>, PB <int>,
## # WP <chr>, SB <chr>, CS <chr>, ZR <chr>
str(attendance)
## 'data.frame': 2805 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
## $ divID : Factor w/ 3 levels "C","E","W": NA NA NA NA NA NA NA NA NA NA ...
## $ Rank : int 3 2 8 7 5 1 9 6 4 2 ...
## $ G : int 31 28 29 19 33 28 25 29 32 58 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 20 19 10 7 16 21 4 13 15 35 ...
## $ L : int 10 9 19 12 17 7 21 15 15 19 ...
## $ DivWin : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
## $ WCWin : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
## $ LgWin : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 1 1 ...
## $ WSWin : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
## $ R : int 401 302 249 137 302 376 231 351 310 617 ...
## $ AB : int 1372 1196 1186 746 1404 1281 1036 1248 1353 2576 ...
## $ H : int 426 323 328 178 403 410 274 384 375 747 ...
## $ X2B : int 70 52 35 19 43 66 44 51 54 94 ...
## $ X3B : int 37 21 40 8 21 27 25 34 26 35 ...
## $ HR : int 3 10 7 2 1 9 3 6 6 14 ...
## $ BB : int 60 60 26 33 33 46 38 49 48 27 ...
## $ SO : int 19 22 25 9 15 23 30 19 13 28 ...
## $ SB : int 73 69 18 16 46 56 53 62 48 35 ...
## $ CS : int NA NA NA NA NA NA NA NA NA 15 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 303 241 341 243 313 266 287 362 303 434 ...
## $ ER : int 109 77 116 97 121 137 108 153 137 173 ...
## $ ERA : num 3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 3.02 ...
## $ CG : int 22 25 23 19 32 27 23 28 32 48 ...
## $ SHO : int 1 0 0 1 1 0 1 0 0 1 ...
## $ SV : int 3 1 0 0 0 0 0 0 0 1 ...
## $ IPouts : int 828 753 762 507 879 747 678 750 846 1545 ...
## $ HA : int 367 308 346 261 373 329 315 431 371 566 ...
## $ HRA : int 2 6 13 5 7 3 3 4 4 3 ...
## $ BBA : int 42 28 53 21 42 53 34 75 45 63 ...
## $ SOA : int 23 22 34 17 22 16 16 12 13 0 ...
## $ E : int 225 218 223 163 227 194 220 198 217 432 ...
## $ DP : int NA NA NA NA NA NA NA NA NA NA ...
## $ FP : num 0.83 0.82 0.81 0.8 0.83 0.84 0.82 0.84 0.85 0.82 ...
## $ name : Factor w/ 139 levels "Altoona Mountain City",..: 17 42 51 63 93 97 111 131 135 5 ...
## $ park : Factor w/ 212 levels "23rd Street Grounds",..: 169 196 115 76 198 86 3 79 129 123 ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 103 104 96 101 90 102 97 101 94 106 ...
## $ PPF : int 98 102 100 107 88 98 99 100 98 102 ...
## $ teamIDBR : Factor w/ 101 levels "ALT","ANA","ARI",..: 10 21 25 42 62 4 77 93 65 6 ...
## $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 24 31 39 56 89 96 110 135 140 8 ...
## $ teamIDretro : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 89 96 110 135 141 8 ...
str(pitching)
## 'data.frame': 44139 obs. of 30 variables:
## $ playerID: Factor w/ 9126 levels "aardsda01","aasedo01",..: 465 837 2502 2576 2608 2626 4968 5180 5255 5430 ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 142 90 111 90 136 111 56 97 136 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 1 12 0 4 0 0 0 6 18 12 ...
## $ L : int 2 15 0 16 1 0 1 11 5 15 ...
## $ G : int 3 30 1 24 1 1 3 19 25 29 ...
## $ GS : int 3 30 0 24 1 0 1 19 25 29 ...
## $ CG : int 2 30 0 22 1 0 1 19 25 28 ...
## $ SHO : int 0 0 0 1 0 0 0 1 0 0 ...
## $ SV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IPouts : int 78 792 3 639 27 3 39 507 666 747 ...
## $ H : int 43 361 8 295 20 1 20 261 285 430 ...
## $ ER : int 23 132 3 103 10 0 5 97 113 153 ...
## $ HR : int 0 4 0 3 0 0 0 5 3 4 ...
## $ BB : int 11 37 0 31 3 0 3 21 40 75 ...
## $ SO : int 1 13 0 15 0 0 1 17 15 12 ...
## $ BAOpp : num NA NA NA NA NA NA NA NA NA NA ...
## $ ERA : num 7.96 4.5 27 4.35 10 0 3.46 5.17 4.58 5.53 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ WP : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ BK : int 0 0 0 0 0 0 0 2 0 0 ...
## $ BFP : int NA NA NA NA NA NA NA NA NA NA ...
## $ GF : int NA NA NA NA NA NA NA NA NA NA ...
## $ R : int 42 292 9 257 21 0 30 243 223 362 ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
str(player_info)
## 'data.frame': 18846 obs. of 24 variables:
## $ playerID : Factor w/ 18846 levels "aardsda01","aaronha01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ birthYear : int 1981 1934 1939 1954 1972 1985 1854 1877 1869 1866 ...
## $ birthMonth : int 12 2 8 9 8 12 11 4 11 10 ...
## $ birthDay : int 27 5 5 8 25 17 4 15 11 14 ...
## $ birthCountry: Factor w/ 53 levels "","Afghanistan",..: 50 50 50 50 50 18 50 50 50 50 ...
## $ birthState : Factor w/ 246 levels "","AB","Aberdeen",..: 44 6 6 30 69 108 173 173 229 148 ...
## $ birthCity : Factor w/ 4714 levels "","Aberdeen",..: 1093 2718 2718 3092 3159 2212 3279 2291 1337 1382 ...
## $ deathYear : int NA NA 1984 NA NA NA 1905 1957 1962 1926 ...
## $ deathMonth : int NA NA 8 NA NA NA 5 1 6 4 ...
## $ deathDay : int NA NA 16 NA NA NA 17 6 11 27 ...
## $ deathCountry: Factor w/ 24 levels "","American Samoa",..: 1 1 22 1 1 1 22 22 22 22 ...
## $ deathState : Factor w/ 93 levels "","AB","AK","AL",..: 1 1 26 1 1 1 57 25 88 12 ...
## $ deathCity : Factor w/ 2554 levels "","Aberdeen",..: 1 1 91 1 1 1 1735 758 462 1984 ...
## $ nameFirst : Factor w/ 2313 levels "","A. J.","Aaron",..: 513 918 2087 599 73 775 1148 649 158 335 ...
## $ nameLast : Factor w/ 9713 levels "Aardsma","Aaron",..: 1 2 2 3 4 4 5 6 7 7 ...
## $ nameGiven : Factor w/ 12437 levels "","A. Harry",..: 2491 5099 11411 2886 3732 3766 6604 3201 954 1714 ...
## $ weight : int 220 180 190 190 184 220 192 170 175 169 ...
## $ height : int 75 72 75 75 73 73 72 71 71 68 ...
## $ bats : Factor w/ 4 levels "","B","L","R": 4 4 4 4 3 3 4 4 4 3 ...
## $ throws : Factor w/ 3 levels "","L","R": 3 3 3 3 2 2 3 3 3 2 ...
## $ debut : Factor w/ 10037 levels "","1871-05-04",..: 8636 4698 5145 6222 8419 9383 106 1132 875 934 ...
## $ finalGame : Factor w/ 9029 levels "","1871-05-05",..: 8991 5859 5560 6745 7984 9028 103 1851 1078 1110 ...
## $ retroID : Factor w/ 18793 levels "","aardd001",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ bbrefID : Factor w/ 18846 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
str(player_salary)
## 'data.frame': 25575 obs. of 5 variables:
## $ yearID : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ teamID : Factor w/ 35 levels "ANA","ARI","ATL",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lgID : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
## $ playerID: Factor w/ 4963 levels "aardsda01","aasedo01",..: 229 288 324 689 795 805 1090 1447 1529 1854 ...
## $ salary : int 870000 550000 545000 633333 625000 800000 150000 483333 772000 250000 ...
str(batters)
## 'data.frame': 101332 obs. of 22 variables:
## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : chr "TRO" "RC1" "CL1" "WS3" ...
## $ lgID : chr NA NA NA NA ...
## $ G : int 1 25 29 27 25 12 1 31 1 18 ...
## $ AB : int 4 118 137 133 120 49 4 157 5 86 ...
## $ R : int 0 30 28 28 29 9 0 66 1 13 ...
## $ H : int 0 32 40 44 39 11 1 63 1 13 ...
## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ...
## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ...
## $ HR : int 0 0 0 2 0 0 0 0 0 0 ...
## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ...
## $ SB : int 0 8 3 1 6 0 0 11 0 1 ...
## $ CS : int 0 1 1 1 2 1 0 6 0 0 ...
## $ BB : int 0 4 2 0 2 0 1 13 0 0 ...
## $ SO : int 0 0 5 2 1 1 0 1 0 0 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
str(fielding)
## Classes 'tbl_df', 'tbl' and 'data.frame': 170526 obs. of 18 variables:
## $ playerID: chr "abercda01" "addybo01" "addybo01" "allisar01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : chr "TRO" "RC1" "RC1" "CL1" ...
## $ lgID : chr NA NA NA NA ...
## $ POS : chr "SS" "2B" "SS" "2B" ...
## $ G : int 1 22 3 2 29 27 1 2 20 5 ...
## $ GS : chr NA NA NA NA ...
## $ InnOuts : chr NA NA NA NA ...
## $ PO : int 1 67 8 1 51 68 7 3 38 10 ...
## $ A : int 3 72 14 4 3 15 0 4 52 0 ...
## $ E : int 2 42 7 0 7 20 0 1 28 8 ...
## $ DP : int 0 5 0 0 1 4 0 0 2 0 ...
## $ PB : int NA NA NA NA NA 0 NA NA NA 0 ...
## $ WP : chr NA NA NA NA ...
## $ SB : chr NA NA NA NA ...
## $ CS : chr NA NA NA NA ...
## $ ZR : chr NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 18
## .. ..$ playerID: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ yearID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ stint : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ teamID : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ lgID : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ POS : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ G : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ GS : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ InnOuts : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ PO : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ A : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ E : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ DP : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PB : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ WP : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ SB : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ CS : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ ZR : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
attendance <- subset(attendance, attendance != 'NA')
attendance <- subset(attendance, yearID > 1969)
pitching <- subset(pitching, pitching !='NA')
pitching[ , colSums(is.na(pitching)) == 0]
## data frame with 0 columns and 1197590 rows
batters <- subset(batters, batters !='NA')
batters$SF <- as.integer(batters$SF)
player_info <- subset(player_info, player_info != 'NA')
fieldING <- subset(fielding, fielding != 'NA')
fielding = subset(fielding, yearID > 1920)
fieldING <- subset(fielding, E != 'NA')
names(player_salary) <- c("yearID","teamID","lgID","playerID","salary")
batters <- dplyr::inner_join(batters,player_salary,by=c("playerID","teamID","lgID","yearID"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
## Warning: Column `teamID` joining character vector and factor, coercing into
## character vector
## Warning: Column `lgID` joining character vector and factor, coercing into
## character vector
pitching <- dplyr::inner_join(pitching,player_salary,by=c("playerID","teamID","lgID","yearID"))
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
## Warning: Column `teamID` joining factors with different levels, coercing to
## character vector
## Warning: Column `lgID` joining factors with different levels, coercing to
## character vector
batters <- dplyr::anti_join(batters,pitching,by="playerID")
ggplot(attendance, aes(yearID, attendance)) + geom_bar(stat="identity", aes(fill=G>157))+
guides(fill=FALSE)+
ggtitle("Total Home Attendance on an Annual Basis (Lockouts in 72, 81, 94, and 95)")+
scale_y_continuous(labels = comma) +
theme_minimal()
product <- ggplot(attendance, aes(W, log10(attendance))) + geom_point(color="darkblue") +
labs(x="Wins", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Wins Based on Attendance") +
geom_smooth(method = "lm", col = "brown")
ggplotly(product)
hrattn <- ggplot(attendance, aes(HR, attendance)) + geom_point(color="brown") +
labs(x="Home Runs", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Home Runs Based on Attendance")+
geom_smooth(method = "lm", col = "orange")
ggplotly(hrattn)
Despite winning 67 games in the 1993 season for Colorado Rockies, it has attracted about 4.5 million attendance. However, it was the first season into the franchise, so there has to be a fanbase for the team in order to attract many people to come. In comparison to the other end, there are some bad teams who’s seasons and attendance were low during the lockout seasons. The 1981 Chicago Cubs were stopped at 38-65 record with an attendance of 565,637 before the season was cut short. Many outliers on this graph. With the regression line, it has been determined that there are more data points being above the threshold.
lossatn <- ggplot(attendance, aes(L, attendance)) + geom_point(color="magenta") + labs(x="Losses", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Losses Based on Attendance")+
geom_smooth(method = "lm", col = "green") +
theme_minimal()
ggplotly(lossatn)
It seems that around 95 losses, there must be a lot of attendance, which is interesting. It does trace back to the 1993 season with Colorado Rockies. It is possible that it could be one of the well-known franchise teams that must have a bad season that is marred by many factors such as injuries, personal, or team conflict. Although, it is hard to predict. Lot more points above the threshold with the errors to attendance ratio. It has been determined that it is not a good model since r-squared is heading to the negative distribution.
errors <- ggplot(attendance, aes(E, log10(attendance))) + geom_point(color="darkred") + labs(x="Errors", y="Attendance") +
scale_y_continuous(labels = comma) +
ggtitle("Attendance Based on Errors")+
geom_smooth(method = "lm", col = "mistyrose")
ggplotly(errors)
salary <- select(player_salary, yearID, teamID, salary)
salary <- group_by(salary, yearID, teamID) %>% summarise(dollars = sum(salary))
team_salary <- left_join(attendance, salary)
## Joining, by = c("yearID", "teamID")
## Warning: Column `teamID` joining factors with different levels, coercing to
## character vector
team_salary <- subset(team_salary, yearID > 1984) #salary data starts at 1984
If any of us are believing that the money is paid for all the high-performing athletes, it isn’t necessarily the case. It is not drawn based on attendance either. There are many factors that determine the high pay for attracting the best talent. It could be based on their performance on defense as well. There are many examples to determine it.
ggplot(team_salary, aes(dollars, attendance)) + geom_point(color="darkgreen",size=4, shape=36) +
labs(x="Team Salary", y="Attendance") +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma) +
ggtitle("Team Salary & Attendance")+
geom_smooth(method = "lm", col = "blue")
In this visual, it has been seen that for the maximum attendance that is around 4.5 million people, there has been a lower team salary, probably because the franchise team had a bad season occurring in factors such as the team management or injuries. For the maximum team salary, it turns out that there is at least an attendance of over 3 million people.
attendance$RD <- attendance$R - attendance$RA
ggplot(data = attendance) + geom_smooth(mapping = aes(x = W, y = RD), color="darkblue") +
ggtitle('Relationship Between Wins and Run Difference') + labs(x="Wins", y="Runs Difference")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
wins_reg <- lm(W ~ RD, data = attendance)
summary(wins_reg)
##
## Call:
## lm(formula = W ~ RD, data = attendance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.709 -2.045 0.919 3.896 13.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.558544 0.187490 424.33 <2e-16 ***
## RD 0.102752 0.001871 54.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.666 on 1262 degrees of freedom
## Multiple R-squared: 0.705, Adjusted R-squared: 0.7048
## F-statistic: 3017 on 1 and 1262 DF, p-value: < 2.2e-16
hist(rnorm(wins_reg$residuals), col = "red", main = "Wins for Run Difference", xlab = "Residuals")
qqnorm(wins_reg$residuals)
qqline(wins_reg$residuals, col = "green")
winsregrx <- rnorm(wins_reg$residuals)
fitdistr(winsregrx, "normal")
## mean sd
## -0.01071004 1.02530327
## ( 0.02883891) ( 0.02039219)
wrfit <- fitdistr(winsregrx, "normal")
wrfit$estimate
## mean sd
## -0.01071004 1.02530327
sum(wins_reg$residuals)/length(wins_reg$residuals)
## [1] -1.695198e-17
wrnd <- rnorm(wins_reg$residuals)
shapiro.test(wrnd)
##
## Shapiro-Wilk normality test
##
## data: wrnd
## W = 0.99819, p-value = 0.1919
Based on the information between the Wins and Runs Difference, this is the equation that is came up for the linear regression mode:
“w = 78.957265 + 0.102115(RD) = 78.957265 + 0.102115(135) ~ 93
It was predicted at 93 runs given in 2002 for the team concentrated on the Oakland A’s performance. It is still one of the predictors in the baseball statistics. In addition with the qqplots and histograms, it turns out that there is a strong correlation to its sample as well on top of predictions. With the p-value predicted that is > 0.05, we can fail to reject the null hypothesis. Sample mean is not equal to zero. The Shapiro-Wilk test shows that we could reject the null hypothesis as to the t-test.
runs_scored <- lm(R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance)
summary(runs_scored)
##
## Call:
## lm(formula = R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.974 -14.334 -0.169 14.181 75.824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.418867 9.469745 0.361 0.7181
## AB -0.108004 0.003312 -32.615 < 2e-16 ***
## H 0.627810 0.012241 51.287 < 2e-16 ***
## X2B 0.199960 0.024100 8.297 2.72e-16 ***
## X3B 0.592950 0.072481 8.181 6.85e-16 ***
## HR 0.956610 0.022378 42.749 < 2e-16 ***
## BB 0.340556 0.009755 34.911 < 2e-16 ***
## SB 0.188601 0.021662 8.707 < 2e-16 ***
## CS -0.096706 0.055403 -1.745 0.0811 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.45 on 1255 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.9488
## F-statistic: 2926 on 8 and 1255 DF, p-value: < 2.2e-16
runs_scored <- lm(R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance, family = "binomial")
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
summary(runs_scored)
##
## Call:
## lm(formula = R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance,
## family = "binomial")
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.974 -14.334 -0.169 14.181 75.824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.418867 9.469745 0.361 0.7181
## AB -0.108004 0.003312 -32.615 < 2e-16 ***
## H 0.627810 0.012241 51.287 < 2e-16 ***
## X2B 0.199960 0.024100 8.297 2.72e-16 ***
## X3B 0.592950 0.072481 8.181 6.85e-16 ***
## HR 0.956610 0.022378 42.749 < 2e-16 ***
## BB 0.340556 0.009755 34.911 < 2e-16 ***
## SB 0.188601 0.021662 8.707 < 2e-16 ***
## CS -0.096706 0.055403 -1.745 0.0811 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.45 on 1255 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.9488
## F-statistic: 2926 on 8 and 1255 DF, p-value: < 2.2e-16
hist(rnorm(runs_scored$residuals), col = "blue", main = "Runs Scored Residuals", xlab = "Residuals")
qqnorm(runs_scored$residuals)
qqline(runs_scored$residuals, col = "salmon")
runsscoredrx <- rnorm(runs_scored$residuals)
fitdistr(runsscoredrx, "normal")
## mean sd
## -0.0001699746 0.9689010446
## ( 0.0272524711) ( 0.0192704071)
rsfit <- fitdistr(runsscoredrx, "normal")
rsfit$estimate
## mean sd
## -0.0001699746 0.9689010446
sum(runs_scored$residuals)/length(runs_scored$residuals)
## [1] -7.358302e-17
srnd <- rnorm(runs_scored$residuals)
shapiro.test(srnd)
##
## Shapiro-Wilk normality test
##
## data: srnd
## W = 0.99918, p-value = 0.8667
srtest <- t.test(srnd)
srtest$p.value
## [1] 0.2871109
By comparing the differences utilizing both the linear and logistic regression models, the results turn out to be the same. R-squared equated to 0.9483, which is a really good model. It turns out that the equation from the model equates to:
= 5.177834 + (-0.105735 * AB) + (0.621239 * H) + (0.207906 * X2B) + (0.544204 * X3B) + (0.944925* HR) + (0.341597 * BB) + (0.262176 * SB) + (-0.274041 * CS)
= 5.177834 + (-0.105735 * 5558) + (0.621239 * 1450) + (0.207906 * 279) + (0.544204 * 28) + (0.944925* 205) + (0.341597 * 609) + (0.262176 * 46) + (-0.274041 * 20) = 799.8642
r ~ 800
It turns out that it is coming close to 800 for Oakland A’s in 2002.
Also, by running the statistical tests, it has been proven that we can fail to reject the null hypothesis again as p is > 0.05. Sample mean is not equated to zero for the scored residuals. The Shapiro-Wilk test shows that we could also fail the reject the null hypothesis again as to the t-test.
runs_allowed <- lm(R ~ SV + IPouts + H + ER + HR + BB + SO + WP + HBP + BK, data = pitching)
summary(runs_allowed)
##
## Call:
## lm(formula = R ~ SV + IPouts + H + ER + HR + BB + SO + WP + HBP +
## BK, data = pitching)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9190 -1.4762 -0.2931 1.1428 19.6198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0257224 0.0457194 -0.563 0.5737
## SV -0.0057003 0.0031700 -1.798 0.0722 .
## IPouts -0.0089103 0.0009841 -9.054 < 2e-16 ***
## H 0.0911030 0.0032199 28.293 < 2e-16 ***
## ER 0.9116464 0.0048153 189.324 < 2e-16 ***
## HR 0.0186677 0.0078987 2.363 0.0181 *
## BB 0.0576205 0.0027694 20.806 < 2e-16 ***
## SO -0.0103622 0.0014128 -7.335 2.37e-13 ***
## WP 0.0720413 0.0107940 6.674 2.60e-11 ***
## HBP 0.0489909 0.0109970 4.455 8.47e-06 ***
## BK 0.1631391 0.0263168 6.199 5.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.66 on 11520 degrees of freedom
## Multiple R-squared: 0.9937, Adjusted R-squared: 0.9937
## F-statistic: 1.814e+05 on 10 and 11520 DF, p-value: < 2.2e-16
hist(rnorm(runs_allowed$residuals), col = "yellow", main = "Runs Allowed Residuals", xlab = "Residuals")
qqnorm(runs_allowed$residuals, col = "green")
qqline(runs_allowed$residuals, col = "darkred")
raregrx <- rnorm(runs_allowed$residuals)
fitdistr(raregrx, "normal")
## mean sd
## -0.021895438 0.998427747
## ( 0.009297863) ( 0.006574582)
rafit <- fitdistr(raregrx, "normal")
rafit$estimate
## mean sd
## -0.02189544 0.99842775
sum(runs_allowed$residuals)/length(runs_allowed$residuals)
## [1] -2.570166e-16
testwr <- t.test(runs_allowed$residuals)
testwr$p.value
## [1] 1
By having this equation:
= 0.036056 + (-0.014268 * SV) + (-0.011360 * IPouts) + (0.102181 * H) + (0.896016 * ER) + (0.026607 * HR) + (0.062142 * BB) + (-0.007035 * SO) + (0.075391 * WP) + (0.078287 * HBP) + (0.149150 * BK)
= 0.036056 + (-0.014268 * 48) + (-0.011360 * 4356) + (0.102181 * 1450) + (0.896016 * 593) + (0.026607 * 205) + (0.062142 * 609) + (-0.007035 * 1008) + (0.075391 * 2.353) + (0.078287 * 68) + (0.149150 * 0.5294)
~ 671
By projecting that Oakland A’s would allow 671 runs, however, it has totaled out to 654, which is still accurate for the 2002 season.
Many teams in the Major League baseball concentrated on Batting Average (BA) too much.Oakland A’s discovered that On-Base Percentage (OBP) and Slugging Percentage are the top category for the most important statistical measure. However, we are going to see if that holds true.
Based on the analysis for the p-value and using the t-test, it has proven that we could fail to reject the null hypothesis for the runs allowed variable. Sample mean is not zero at all.
get_wins <- attendance
get_wins$win_ratio <- get_wins$W/get_wins$G
oakland <- subset(get_wins, teamID == 'OAK' & yearID >= 2002)
athletics <- ggplot(data = oakland) +
geom_line(mapping = aes(x = yearID, y = win_ratio), color="darkseagreen") +
ggtitle("Oakland A's Performance During 2002-2016") +
labs(x="year", y="win-ratio") +
scale_x_continuous(breaks = round(seq(min(oakland$year), max(oakland$year), by = 1),1))
ggplotly(athletics)
In this part, the model runs at r-squared that is 0.99, which it is almost perfect when there are more than two variables that are described in analyzing the wins, runs, and scores. By comparison with the first two regression models, the runs scored is also works and will be used for analysis, which is at 0.94. However, looking at the first model, the fit is only OK given that r-squared is at 0.71, and it does not always have valid information to make decisions by looking at two variables focusing only on wins attribute. Proven that focusing at least on 3 or more variables would give a better description on making the analyses on the model.
test<- subset(attendance, yearID >= 1985 & yearID < 2002)
test$obp <- (test$H + test$HBP + test$BB)/(test$AB + test$BB + test$HBP + test$SF)
test$slg <- ((test$BB) + (2 * test$X2B) + (3 * test$X3B) + (4 * test$HR))/(test$AB)
test$ba <- (test$H)/(test$AB)
ggplot(test, aes(W)) +
geom_smooth(aes(y = obp, color="obp")) +
geom_smooth(aes(y = slg, color="slg")) +
geom_smooth(aes(y = ba, color="ba")) + labs(x="Wins", y="%") +
ggtitle("Comparison of OBP, SLG and BA")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 408 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the information that is gathered in the above plot, it is deemed that the On-Base Percentage (OBP) and Slugging Percentage (SLG) statistics are more important based on the earlier prediction on the analysis.
p <- ggplot(player_salary, aes(x=factor(yearID), y=log(salary), fill = lgID)) +
geom_boxplot() +
xlab("Year") +
ylab("Annual Salary in Millions") +
ggtitle("Breakdown with the League Differences in Salaries from 1985") +
scale_x_discrete(breaks=unique(player_salary$yearID)[c(TRUE, FALSE)]) +
scale_y_continuous(label=dollar) +
theme_light(base_size=10)
print(p)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
For the salaries comparison between the leagues, it has turned out that there has been equal distributions overtime adjusted for inflation. However, it turns out that American League has been getting the higher pay for the last ten years except in 2014.
homeruns_by_year <- batters %>%
group_by(yearID) %>%
summarise(HR = sum(HR))
p1 <- ggplot(homeruns_by_year, aes(x=yearID, y=HR)) +
geom_point() +
xlab("Year") +
ylab("Home Runs") +
scale_x_discrete(breaks=unique(homeruns_by_year$yearID)[c(TRUE, FALSE)]) +
scale_y_continuous() +
theme_light(base_size=10) +
ggtitle("Batters Overtime")
ggplotly(p1)
hr_since50 <- batters %>% filter(yearID>=1950) %>% select(playerID,yearID,stint,HR);
hr_season_since50 <- hr_since50 %>% group_by(playerID, yearID) %>% summarise(seasonHR=sum(HR));
hr_season_since50 <- hr_season_since50 %>% group_by(playerID) %>% mutate(careerHR=sum(seasonHR),career_year=n()) %>% filter(career_year>=5);
yearly_hr_vs_career_avg <- hr_season_since50 %>% mutate(career_per_season=careerHR/career_year, year_vs_avg=seasonHR/career_per_season);
hr_fluke_ratio_and_age <- inner_join(yearly_hr_vs_career_avg, player_info, by="playerID") %>%
mutate(age=yearID-birthYear)%>%
select(playerID, nameFirst, nameLast, yearID, age, seasonHR, careerHR, career_year, career_per_season, year_vs_avg) %>%
ungroup();
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
out<-filter(hr_fluke_ratio_and_age, seasonHR>=10, year_vs_avg>=2) %>% arrange(desc(year_vs_avg));
datatable(out, class='compact');
out2 <- out %>% arrange(desc(seasonHR));
datatable(out2, class='compact');
onehit <- ggplot(out, aes(age,year_vs_avg)) +
geom_point(color = "blue", alpha = 0.4) +
labs(x="Player Age", y="Season HR per Career Average") +
ggtitle("Players With One Season HR More Than 2x Career Average") +
theme_minimal();
ggplotly(onehit)
Based on the visualization, it turns out that the late 20s to the mid 30s are the players that have had a great season for the one-hit wonders. The data is compared from the Home Run per season vs. the career. Outliers happen more after the late to mid 30s.
options(dplyr.width = Inf)
pitching <- pitching %>% filter(yearID>=1985) %>%
select(playerID,yearID,stint,W,L,G,GS,GF,SV,IPouts,H,BB,ER,SO,R)
pitching <- pitching %>%
group_by(playerID, yearID) %>%
summarise(W=sum(W), L=sum(L), G=sum(G), GS=sum(GS),GF=sum(GF),SV=sum(SV),IPouts=sum(IPouts),H=sum(H),BB=sum(BB),ER =sum(ER),SO=sum(SO),R=sum(R)) %>%
ungroup();
rp_pitching <- pitching %>% filter(GS==0);
player_info <- player_info %>%
separate(debut, c("debut"), sep="-") %>%
select(playerID, birthYear, nameFirst, nameLast, debut)
## Warning: Too many values at 18653 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
rp_pitching <- inner_join(rp_pitching, player_info, by="playerID")%>%
mutate(id_year=paste(playerID, yearID, sep="_"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
player_salary <- player_salary %>%
mutate(id_year = paste(playerID, yearID, sep="_")) %>%
select(id_year,salary)
rp_pitching <- inner_join(rp_pitching, player_salary, by="id_year");
rp_pitching<-rp_pitching %>%
mutate(gf_ratio=GF/162, gf_percent=GF/G)
rp_pitching<-rp_pitching %>%
group_by(yearID) %>%
mutate(season_avg_salary=mean(salary)) %>%
ungroup() %>%
mutate(salary_ratio=salary/season_avg_salary, season=yearID-as.numeric(debut), whip=(H+BB)/IPouts*3, era=ER/IPouts*27);
rp_pitching<-do.call(data.frame,lapply(rp_pitching, function(x) replace(x, is.infinite(x),NA)))
cor(rp_pitching$salary_ratio, rp_pitching$whip,use="pairwise.complete")
## [1] -0.1012409
cor(rp_pitching$salary_ratio, rp_pitching$era,use="pairwise.complete")
## [1] -0.09562511
closers<- rp_pitching %>% filter(SV > 15);
nonclosers<- rp_pitching %>% filter(SV <= 15);
fa_closers <- closers %>% filter(season>=7&IPouts>=150);
fa_nonclosers <- nonclosers %>% filter(season>=7&IPouts>=150);
nrow(fa_closers)
## [1] 286
nrow(fa_nonclosers)
## [1] 633
Based on this threshold, we have 315 FA closer seasons vs. 715 FA Non-Closer seasons. It has been determined that there needs to be more analysis by changing the limits on more than one save per season.
boxplot(fa_closers$gf_percent*100, fa_nonclosers$gf_percent*100,
names=c("Closers", "Non-Closers"),
ylab="% of Finishing a Game (Being Last Pitcher In Game)",
main = "Closers vs. Non-Closers",
col = "lightcoral")
summary(fa_closers$gf_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3544 0.8143 0.8646 0.8432 0.9000 1.0000
summary(fa_nonclosers$gf_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05263 0.20270 0.28750 0.30963 0.39216 0.88095
By looking at the boxplots, closers finished the game 83% of the time, while the non-closers only finish in the 31% of the time. The information is based on the difference with 15 or more saves per season.
plot(density(fa_closers$gf_percent*100), xlim=c(0,100), col="blue", xlab="% of Finishing a Game (Being Last Pitcher in Game)", main="")
lines(density(fa_nonclosers$gf_percent*100), col="darkred")
legend(0,0.06,c("Non-Closers", "Closers"),c("darkred", "blue"))
summary(fa_closers$whip)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6069 1.0307 1.1646 1.1791 1.3306 1.8068
summary(fa_nonclosers$whip)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.773 1.189 1.305 1.315 1.432 1.962
summary(fa_closers$salary_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2226 2.0451 3.1861 3.4617 4.5667 9.0269
summary(fa_nonclosers$salary_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1409 0.6725 1.2106 1.3963 1.9288 8.3391
boxplot(fa_closers$whip, fa_nonclosers$whip,
names=c("Closers", "Non-Closers"), ylab="WHIP (Hit+Walk per Inning)",
main = "Difference of WHIP",
col = "darkolivegreen")
In this analysis on the WHIP difference, it turns out that the non-closers have made more of an impact than the closers. Although, when it comes to the salary ratio, the closers get paid more, particularly the relief pitchers by difference of per inning.
boxplot(fa_closers$salary_ratio, fa_nonclosers$salary_ratio,
names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
main = "Salary Ratio on WHIP Differences",
col = "goldenrod")
plot(fa_closers$salary_ratio,fa_closers$whip,
xlab="Salary / Avg. Relief Pitcher Salary", ylab="WHIP (Hit+Walk per Inning)",
main = "Difference Between the Salary Ratio on Relief Pitchers",
xlim=c(0,10), ylim=c(0.6,2.0), col = "green", pch = 19)
points(fa_nonclosers$salary_ratio, fa_nonclosers$whip, col=2, pch = 19)
legend(7,2.0,c("Closers", "Non-Losers"),c("green", "red"))
low_whip_closers<-fa_closers %>% filter(whip<=1.20)
low_whip_nonclosers<-fa_nonclosers %>% filter(whip<=1.20)
nrow(low_whip_closers)
## [1] 164
nrow(low_whip_nonclosers)
## [1] 166
By looking at the scatterplot of the relief pitcher salary, it has been determined that the closers have a migh higher pay based on the green dots that are in the right side of the distribution of the analysis. There are more red dots on the left, and at most get up to the middle of the distribution. The one outlier for the non-closer must have to be the best player in the league in order to achieve that status to receive the highest pay in the league.
boxplot(low_whip_closers$whip, low_whip_nonclosers$whip,
names=c("Closers", "Non-Closers"), ylab="WHIP (Hit+Walk per Inning)",
main = "Low WHIP Non-Closers",
col = "skyblue")
boxplot(low_whip_closers$salary_ratio, low_whip_nonclosers$salary_ratio,
names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
main = "Difference with Relief Pitchers",
col = "maroon")
plot(low_whip_closers$salary_ratio,low_whip_closers$whip,
xlab= "Salary / Avg. Relief Pitcher Salary", ylab="WHIP (Hit+Walk per Inning)",
main = "Difference of Salaries",
xlim=c(0,10), ylim=c(0.6,1.3), col = "blue", pch = 16)
points(low_whip_nonclosers$salary_ratio, low_whip_nonclosers$whip, col=2, pch = 16)
legend(7,1.3,c("Closers", "Non-Losers"),c("blue", "red"))
Again, by looking at the plots again, it has been determined that the closing relief pitchers have been making more closing moments in the game headed towards the right in the distribution points that are shaded in blue. Same comparison as with the previous visuals with the walks per hit and inning ratio.
low_era_closers<-fa_closers %>% filter(era<=3.0)
low_era_nonclosers<-fa_nonclosers %>% filter(era<=3.0)
nrow(low_era_closers)
## [1] 161
nrow(low_era_nonclosers)
## [1] 172
boxplot(low_era_closers$era, low_era_nonclosers$era,
names=c("Closers", "Non-Closers"), ylab="ERA (ER per 9 Innings)",
main = "Comparisons with ERA on Closers vs. Non-Closers",
col = "red")
By looking at the boxplots again for closers and non-closers, it has seemed that the closers have made more salary when the game goes down the wire by playing all 9 innings and closing down the game. Again, this is the impact with the relief pitchers.
boxplot(low_era_closers$salary_ratio, low_era_nonclosers$salary_ratio,
names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
main = "Salary Ratio Between the Closers vs. Non-Closers",
col = "khaki")
plot(low_era_closers$salary_ratio,low_era_closers$era,
xlab="Salary / Avg. Relief Pitcher Salary", ylab="ERA (ER per 9 Innings)",
main = "Salaries for ERA Comparisons",
xlim=c(0,10), ylim=c(0.6,3.0), col = "gold", pch = 16)
points(low_era_nonclosers$salary_ratio, low_era_nonclosers$era, pch = 16, col = "green")
legend(7,1.0,c("Closers", "Non-Losers"),c("gold", "green"))
For the concluding analyses on the closers, it turns out that the closers make more money than the non-closers based on the given performance on having more saves to end the game. However, it is important to secure leads during late in the game to earn victories, and not playing extra innings.
errors_by_year = tapply(fieldING$E, fieldING$yearID, sum)
years = factor(fieldING$yearID)
years = levels(years)
ebyt2 = cbind(errors_by_year, years)
ebyt2 = data.frame(ebyt2)
ggplot(ebyt2, aes(years, errors_by_year)) +
geom_bar(stat = 'identity', col = "darkred", fill = "skyblue") +
labs(title="Errors By Year Since 1920", x = "Years", y = "Errors By Year") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_discrete(breaks = pretty(errors_by_year, n=10))
The data is all chronological for the errors that are done based on the r console. However, the focus is going to look at 1920 to present, since the analyses before is not as much of a factor for errors.
atopper = ddply(fieldING, "POS", summarise, errors_by_position = sum(E))
ggplot(atopper, aes(POS, errors_by_position)) +
geom_bar(stat='identity', fill = "red") +
labs(title="Errors by Position Since 1920", x = "Position", y = "Total # of Errors by Position")
Based on position for errors in baseball, it turns out that Shortstop makes the most errors in the game. It is not a surprise as they could miss catches, since they are at a closer to the 2nd Base and a distance from 3rd as well. Possibilities are also the miss throws when the receiver couldn’t catch the ball, also dropped balls as well.
btopper = ddply(fieldING, "teamID", summarise, errors_by_team = sum(E))
ggplot(btopper, aes(teamID, errors_by_team)) +
geom_bar(stat='identity', fill = 'violet') +
labs(title = "Errors by Team Since 1920", x = "Team", y = "Total # of Errors")
I have learned the importance of tackling datasets that involves with having multiple spreadsheets from the information that is given. Baseball is a sport that can have unpredictability when it comes to the closers, where they have made more of an impact of winning for the team.I have taken the opportunity to expand the visuals and to implement more statistical practices on using the dataset. Utilized cleaning on the datasets along the way, and not doing it all at one setting. By visualizing the data from the past, it turns out that it is possible to look at some statistics from the beginning. Most of it has more of an impact in the last 30 years when it comes to performance and salaries. Also, would want to have better identification on which model would be the best for further analyses.
Would want to put it on a Shiny App if possible. Take more of different methods from what I have learned and to go beyond. Also would do differences with the interaction with the regression models that I have created as well.